-
Couldn't load subscription status.
- Fork 97
Regex support for hexadecimal and unicode escapes #1341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
added Java/JS support for regular expressions with hexadecimal and unicode escape sequences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmasroca @jgaleotti thx for this PR! but i m a bit confused about EOF... not saying it is wrong, but i don't understand why it was needed to be added, and what possible side effects it could have
| // Parser rules have first letter in lower-case | ||
|
|
||
| pattern : disjunction; | ||
| pattern : disjunction EOF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this EOF?
how would it work when dealing with strings that don't have it?
| // Parser rules have first letter in lower-case | ||
|
|
||
| pattern : disjunction; | ||
| pattern : disjunction EOF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment
|
|
||
| val gene = RegexGene("regex", disjList,"${RegexGene.JAVA_REGEX_PREFIX}$text") | ||
| // we remove the <EOF> token from end of the string to store as sourceRegex | ||
| val gene = RegexGene("regex", disjList,"${RegexGene.JAVA_REGEX_PREFIX}${text.substring(0,text.length - EOF_TOKEN.length)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the text does not have EOF?
|
|
||
| val gene = RegexGene("regex", disjList,"${RegexGene.JAVA_REGEX_PREFIX}$text") | ||
| // we remove the <EOF> token from end of the string to store as sourceRegex | ||
| val gene = RegexGene("regex", disjList,"${RegexGene.JAVA_REGEX_PREFIX}${text.substring(0, text.length - EOF_TOKEN.length)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment
By default, ANTLR4 tries to match as much input as possible according to the grammar rules. Without EOF, it may stop parsing after the longest valid match and silently ignore the rest. Adding EOF forces it to consume the entire input, which helps detect leftover or invalid tokens. This was needed for tests that intentionally feed invalid input. Regarding side effects, inputs containing invalid/unsupported input would now cause an exception instead of silently dropping part of the input. https://github.com/antlr/antlr4/blob/master/doc/parser-rules.md#start-rules-and-eof |
|
merged into #1349 to be able to run CI on it |
External pr lmasroca from #1341
Added support for short hexadecimal escapes (\x00..\xff) and unicode escapes (\u0000..\uffff) for Java and JavaScript regular expressions.